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Abstract 

In  this  paper,  we  investigate  the  hypothesis  that  plan  recog¬ 
nition  can  significantly  improve  the  performance  of  a  case- 
based  reinforcement  learner  in  an  adversarial  action  selec¬ 
tion  task.  Our  environment  is  a  simplification  of  an  Ameri¬ 
can  football  game.  The  performance  task  is  to  control  the 
behavior  of  a  quarterback  in  a  pass  play,  where  the  goal  is  to 
maximize  yardage  gained.  Plan  recognition  focuses  on  pre¬ 
dicting  the  play  of  the  defensive  team.  We  modeled  plan 
recognition  as  an  unsupervised  learning  task,  and  conducted 
a  lesion  study.  We  found  that  plan  recognition  was  accurate, 
and  that  it  significantly  improved  performance.  More  gener¬ 
ally,  our  studies  show  that  plan  recognition  reduced  the  di¬ 
mensionality  of  the  state  space,  which  allowed  learning  to  be 
conducted  more  effectively.  We  describe  the  algorithms,  ex¬ 
plain  the  reasons  for  performance  improvement,  and  also  de¬ 
scribe  a  further  empirical  comparison  that  highlights  the  util¬ 
ity  of  plan  recognition  for  this  task. 

1.  Motivation  and  Contributions 

Large  state  spaces  pose  a  challenge  for  reinforcement 
learning  (RL)  algorithms  due  to  the  amount  of  data  re¬ 
quired  to  develop  accurate  action-selection  policies.  For 
example,  when  using  the  observed  state  variables,  the  per¬ 
formance  task  that  we  analyze  in  this  paper  has  a  large 
state  space  (4.3  *109),  which  is  common  for  adversarial 
multiagent  environments.  Due  to  this  and  other  characteris¬ 
tics  of  our  task,  if  we  used  a  simple  Q-learning  algorithm, 
then  learning  an  accurate  policy  would  require  an  inordi¬ 
nately  large  (and  practically  infeasible)  number  of  trials. 

Case-based  reasoning  (CBR)  methods  are  an  attractive 
approach  for  solving  this  problem  because  they  assume 
that  the  same  (or  similar)  actions  are  best  performed 
among  a  given  set  of  similar  states.  When  this  assumption 
holds,  then  generalizing  from  previous  experiences  can 
greatly  reduce  the  number  of  states  that  need  to  be  visited, 
during  trials,  to  learn  an  accurate  policy.  Also,  CBR  meth¬ 
ods  are  comparatively  simple  to  encode,  intuitive,  and  have 
a  good  performance  record  for  assisting  with  reinforcement 
learning  (e.g.,  Ram  &  Santamaria,  1997;  Sharma  et  al., 
2007;  Molineaux  et  al.,  2008). 


Copyright  ©  2009,  Association  for  the  Advancement  of  Artificial  Intelli¬ 
gence  (www.aaai.org).  All  rights  reserved. 


Unfortunately,  CBR  methods  are  not  a  panacea;  they 
provide  only  one  part  of  a  solution  to  this  problem.  For 
example,  like  reinforcement  learning  algorithms  (Barto  & 
Mahadevan,  2003)  they  learn  slowly  when  state  descrip¬ 
tions  have  high  dimensionality  because  this  complicates 
the  task  of  identifying  similar  cases.  Thus,  they  can  benefit 
from  techniques  that  reformulate  the  state  space  to  address 
this  problem  (e.g.,  Aha,  1991;  Fox  &  Leake,  2001). 

One  method  for  reformulating  the  state  space  involves 
using  plan  recognition  (Sukthankar,  2007)  to  reveal  hidden 
variables  (e.g.,  concerning  opponent  intent),  which  can 
then  be  incorporated  into  the  state  space  used  by  learning 
algorithms.  This  has  the  potential  to  transform  a  partially 
observable  environment  into  a  fully  observable  environ¬ 
ment  (Russell  &  Norvig,  2003). 

We  investigate  the  utility  of  a  plan  recognition  method 
for  reformulating  the  state  space  of  a  case-based  rein¬ 
forcement  learning  algorithm  so  as  to  improve  its  perform¬ 
ance  on  a  complex  simulation  task.  We  claim  that  plan 
recognition  can  significantly  increase  long-term  rewards  on 
this  task,  describe  an  algorithm  and  its  empirical  study  that 
supports  this  conclusion,  hypothesize  a  reason  for  its  good 
performance,  and  report  on  its  subsequent  investigation. 

Section  2  describes  our  task  environment,  which  is  a 
limited  American  football  game  simulation.  We  then  de¬ 
scribe  related  work  in  Section  3  before  introducing  our 
case-base  reinforcement  learner,  and  its  plan  recognition 
extensions,  in  Section  4.  Our  empirical  study,  results  anal¬ 
ysis,  and  subsequent  investigation  are  described  in  Section 
5.  We  discuss  these  in  Section  6  and  conclude  in  Section  7. 


2.  Domain  and  Performance  Task 

American  football1  is  a  game  of  skill  played  by  two  teams 
on  a  rectangular  field.  Rush  2005“  is  an  open-source 
American  football  simulator  whose  teams  have  only  8 
players  and  whose  field  is  100x63  yards.  We  use  a  variant 
of  Rush  that  we  created  (RUSH  20083)  for  our  investigation. 


1  http://en.wikipedia.org/wiki/American_Football 

2  http://rush2005.sourceforge.net/ 

1  http://www.knexusresearch.com/projects/rush 
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Figure  1:  Our  study’s  starting  Rush  2008 
formation  for  the  offense  (blue)  and  defense 
(red),  with  some  player/p ositi on  annotations. 


Our  investigation  involved  learning  to  control  the  quar¬ 
terback’s  actions  on  repeated  executions  of  the  same  offen¬ 
sive  play  (a  pass).  In  this  context,  the  offensive  team’s 
players  are  instructed  to  perform  the  following  actions: 

Quarterback  (QB):  Given  the  ball  at  the  start  of  play 
while  standing  3  yards  behind  the  center  of  the  line  of 
scrimmage  (LOS),  our  QB  agent  decides  whether  and 
when  he  runs,  stands,  or  throws  (and  to  which  receiver). 
Running  Back  (RB):  Starts  3  yards  behind  the  QB,  runs  a 
pass  route  7  yards  left  and  4  yards  downfield. 

Wide  Receiver  #1  (WR1):  Starts  16  yards  to  the  left  of  the 
QB  on  the  LOS,  runs  5  yards  downfield  and  turns  right. 
Wide  Receiver  #2  (WR2):  Starts  16  yards  to  the  right  of 
the  QB  a  few  yards  behind  the  LOS,  runs  5  yards  down- 
field,  and  waits. 

Tight  End  (TE):  Starts  8  yards  to  the  right  of  the  QB  on 
the  LOS  and  pass-blocks. 

Offensive  Linemen  (OL):  These  3  players  begin  on  the 
LOS  in  front  of  the  QB  and  pass-block  (for  the  QB). 

In  our  investigation,  the  defense  always  used  plays  start¬ 
ing  from  the  same  formation,  and  acts  as  follows: 

Defensive  lineman  (DL):  These  2  players  line  up  across 
the  LOS  from  the  OL  and  try  to  tackle  the  ball  handler. 
Linebacker  (LB):  These  2  players  start  behind  the  DL, 
and  will  blitz  the  QB,  guard  a  particular  zone  of  the 
field,  or  guard  an  eligible  receiver  (i.e.,  the  RB,  WR1, 
or  WR2),  depending  on  the  play. 

Cornerback  (CB):  These  2  players  line  up  across  the  LOS 
from  the  WRs  and  guard  a  player  or  a  zone  on  the  field. 
Safety  (S):  These  2  players  begin  10  yards  behind  the  LOS 
and  assist  with  pass  coverage  or  chase  offense  players. 

Figure  1  displays  the  starting  formation  we  used  for  both 
teams  in  each  play.  All  players  pursue  a  set  play  based  on 
their  specific  instructions  (i.e.,  for  the  offense,  a  location  to 
run  to  followed  by  a  behavior  to  execute),  except  for  the 
QB,  whose  actions  are  controlled  by  our  learning  agent. 
However,  due  to  the  stochastic  nature  of  the  simulator,  the 
play  does  not  unfold  the  same  way  each  time.  Each  player 
possesses  unique  skills  (specified  using  a  10-point  scale) 
including  power,  speed,  and  skill;  these  affect  his  ability  to 
handle  the  ball,  block,  run,  and  tackle  other  players.  The 
probability  that  a  passed  ball  is  caught  is  a  function  of  the 


Figure  2:  Six  of  the  QB’s  eight  possible  actions. 
Not  shown  are  Throw  RBand  Noop. 


number  of  defenders  near  the  intended  ball  receiver,  the 
skills  of  the  ball  receiver  and  the  nearby  defenders  (if  any), 
and  the  distance  in  which  the  ball  was  thrown. 

The  physics  of  the  simulator  are  simplified.  When  a 
player  or  the  ball  starts  to  move,  it  takes  on  a  constant  ve¬ 
locity,  with  the  exception  that  the  ball  will  accelerate 
downwards  due  to  gravity.  All  objects  are  represented  as 
rectangles  that  interact  when  they  overlap  (resulting  in  a 
catch,  block,  or  tackle). 

Within  Rush,  we  examine  the  task  of  learning  how  to 
control  the  quarterback’s  actions  so  as  to  optimize  yardage 
gained  on  a  single  (repeated)  play.  At  the  start  of  each 
play,  the  defense  secretly  and  randomly  chooses  one  of 
five  plays/strategies  that  begin  from  the  same  known  for¬ 
mation.  These  plays  are  named  “Half-and-Half ’,  “Soft 
Covers”,  “Pass  Blanket”,  “Hard  Blitz”,  and  “Pressure  RB”. 
The  offensive  team  always  uses  the  same  passing  play,  as 
detailed  above.  Only  the  QB  is  controlled.  The  other  play¬ 
er’s  actions  are  slightly  variable,  and  they  may  not  run  the 
same  path  every  time,  even  though  they  will  follow  the 
same  general  directions.  This  task  is  stochastic  because  the 
other  players’  actions  are  random  within  certain  bounds.  It 
is  also  partially  hidden',  while  each  player’s  positions  and 
movements  are  visible,  one  of  the  determinants  of  those 
movements  (i.e.,  the  defensive  strategy)  is  not  observable. 

The  QB  can  perform  one  of  eight  actions  (see  Figure  2) 
at  each  time  step  during  the  offensive  play.  The  first  four, 
Forward,  Back,  Left,  and  Right  cause  the  QB  to  move  in 
a  certain  direction  for  one  time  step.  Three  more  cause  the 
QB  to  pass  to  a  receiver  (who  is  running  a  pre-determined 
pass  route):  Throw  RB,  Throw  WR1,  Throw  WR2.  Fi¬ 
nally,  one  action  causes  the  quarterback  to  stand  still  for  a 
time  step:  Noop.  The  QB  may  decide  to  run  the  football 
himself.  The  quarterback  must  choose  actions  until  either 
he  throws  the  ball,  crosses  into  the  end  zone  (i.e.,  scores  a 
touchdown  by  gaining  50  yards  from  the  LOS),  or  is  tack¬ 
led.  If  the  QB  passes,  no  more  actions  are  taken,  and  the 
play  finishes  as  soon  as  an  incompletion  occurs,  an  inter¬ 
ception  occurs,  or  the  successful  receiver  has  been  tackled 
or  scores  a  touchdown. 

At  the  start  of  each  play,  the  ball  is  placed  at  the  center 
of  the  line  of  scrimmage  (LOS)  along  the  50  yard  line.  The 
agent’s  reward  is  1000  for  a  touchdown  (i.e.,  a  gain  of  at 
least  50  yards),  -1000  for  an  interception  or  fumble,  or  is 
otherwise  ten  times  the  number  of  yards  gained  (e.g.,  0  for 
an  incomplete  pass)  when  the  play  ends.  A  reward  of  0  is 
received  for  all  actions  before  the  end  of  the  play.  Touch- 


downs,  interceptions,  and  fumbles  are  relatively  rare. 
Touchdowns  occur  between  0.01%  of  the  time  (for  a  low 
performer)  and  0.2%  of  the  time  (for  a  high  performer). 
Interceptions  and  fumbles  combined  occur  between  1% 
and  3%  of  the  time. 

3.  Related  Work 

Plan  recognition  concerns  the  task  of  inferring  the  goals  of 
an  agent  and  their  plan  for  achieving  them  (Carberry, 
2001).  Ours  is  a  simple  instantiation  of  this  in  which  we 
know  the  opponents’  goals  (i.e.,  minimize  yardage  gained 
and  gain  possession  if  possible),  and  few  plans  are  used. 

Plan  recognition  has  a  long  history  in  CBR  research 
(e.g.,  Kass,  1991),  particularly  in  the  context  of  adversarial, 
real-time  multiagent  games.  For  example,  Fagan  and  Cun¬ 
ningham  (2003)  acquire  cases  (state-action  planning  se¬ 
quences)  for  predicting  a  human’s  next  action  while  play¬ 
ing  Space  Invaders™  .  We  instead  focus  on  predicting  the 
actions  of  a  team  of  coordinating  players.  Cheng  and  Tha- 
wonmas  (2004)  propose  a  case-based  plan  recognition  ap¬ 
proach  for  assisting  players  with  low-level  management 
tasks  in  WARGUS.  However,  they  do  not  observe  the  adver¬ 
sary’s  tactical  movements,  which  is  our  focus.  Finally,  Lee 
et  al.  (2008)  use  Kerkez  and  Cox’s  (2003)  technique  to 
create  an  abstract  state,  which  counts  the  number  of  in¬ 
stances  of  each  type-generalized  state  predicate.  On  a  sim¬ 
plified  WARGUS  task,  their  integration  of  CBR  with  a  sim¬ 
ple  reinforcement  learner  performs  much  better  when  using 
the  abstract  state  representation  to  predict  opponent  ac¬ 
tions.  While  our  approach  also  performs  state  abstraction, 
our  states  are  not  described  by  relational  predicates,  and 
this  technique  cannot  be  applied  to  our  task. 

Several  additional  CBR  researchers  have  recently  inves¬ 
tigated  planning  techniques  in  the  context  of  real-time  si¬ 
mulation  games  (e.g.,  Aha  et  al.,  2005;  Ontanon  et  al., 
2007;  Sugandh  et  al.,  2008).  While  some  employed  rein¬ 
forcement  learning  algorithms  (e.g.,  Sharma  et  al.,  2007; 
Molineaux  et  al.,  2008;  Auslander  et  al.,  2008),  none  lev¬ 
eraged  plan  recognition  techniques. 

CBR  is  frequently  used  in  team  simulation  games  such 
as  RoboCup  Soccer  (e.g.,  Karol  et  al.,  2003;  Srinivasan  et 
al.,  2006;  Ros  et  al.,  2007).  Unlike  our  own,  these  efforts 
have  not  focused  on  plan  recognition  or  on  alternative  ap¬ 
proaches  for  learning  a  state  representation  to  enhance  re¬ 
inforcement  learning  behavior.  Among  more  closely- 
related  work,  Wendler  and  Bach  (2003)  report  excellent 
results  for  a  CBR  algorithm  that  predicts  agent  behaviors 
from  a  pre-defined  set.  We  instead  use  plan  recognition  to 
assist  reinforcement  learning,  and  our  opponent’s  behav¬ 
iors  are  instead  learned  via  clustering.  Finally,  Steffens 
(2005)  examines  the  utility  of  adding  virtual  features  that 
model  the  opponent’s  team  and  showed  that,  when 
weighted  appropriately,  can  significantly  increase  player 
prediction  accuracies.  However,  these  features  were  hand- 
coded  rather  than  learned  via  a  plan  recognition  method. 

There  has  been  limited  use  of  clustering  to  assist  with 
plan  recognition  in  related  tasks.  For  example,  Riley  et  al. 
(2002)  use  a  clustering  technique  based  on  fitting  minimal 


rectangles  to  player  logs  of  Robocup  simulator  league  data 
to  identify  player  home  areas.  A  player's  home  area  is  de¬ 
fined  as  the  segment  of  the  field  where  the  player  spends 
90%  of  the  game  time.  However,  knowing  a  player's  home 
area  is  insufficient  to  perform  state-space  reduction.  In 
general,  our  use  of  an  EM  clustering  approach  for  plan 
recognition  is  fairly  unique;  most  related  research  focuses 
on  determining  which  plan  is  being  executed  rather  than 
the  plan’s  cluster/category. 

Finally,  we  recently  reported  successful  results  when  us¬ 
ing  a  supervised  plan  recognition  approach  to  predict  the 
offensive  team’s  play  (Shore  et  al.,  submitted),  but  we  did 
not  use  it  as  leverage  in  a  subsequent  learning  task,  which 
is  the  focus  of  this  paper. 

4.  Algorithm 

Our  algorithm  is  based  on  the  Q(L)  algorithm  (Sutton  & 
Barto,  1998);  it  uses  a  set  of  case  bases  to  approximate  the 
Q  function  and  an  EM  clustering  algorithm  to  add  oppo¬ 
nent  plan  information  to  the  state.  We  call  it  Case-Based 
Q-Lambda  with  Plan  Recognition  (CBQL-PR). 

4.1  Plan  Recognition  Task 

In  CBQL-PR,  plan  recognition  is  an  online  learning  task 
that  clusters  the  observable  movements  of  all  the  defensive 
players  into  groups.  The  perceived  movement  mBM  for 
each  defensive  player  is  the  direction  that  player  is  moving 
during  a  time  step,  which  has  nine  possible  values: 

M  =  {None,  Forward,  Left,  Right,  Back,  Forward- 
Right,  Forward-Left,  Back-Right,  Back-Left} 

Directions  are  geocentric;  Forward  is  always  in  the 
direction  of  play  (downfield),  and  all  other  directions  are 
equally  spaced  at  45-degree  angles.  Clustering  is  per¬ 
formed  after  the  third  time  step  of  each  play,  so  three 
“snapshots”  of  the  defensive  players’  movements  are 
used.  Thus,  24  features  are  used  to  represent  defensive 
plays  (i.e.,  the  directions  on  each  of  three  steps  for  each 
of  eight  defensive  players).  For  the  first  1000  trials,  ex¬ 
amples  were  added  to  the  batch  to  be  clustered,  but  the 
predicted  cluster  (i.e.,  the  recognized  plan)  was  not  used 
in  action  selected. 

We  used  the  Expectation-Maximization  (EM)  algorithm 
from  the  Weka1  suite  of  machine  learning  software  for 
clustering.  EM  iteratively  chooses  cluster  centers  and 
builds  new  clusters  until  the  centers  move  only  marginally 
between  iterations.  Membership  of  an  example  in  a  cluster 
is  calculated  as  the  product  of  the  within-cluster  frequen¬ 
cies  of  each  value  in  the  feature  vector.  EM  also  increases 
the  number  of  clusters  to  discover  until  successive  steps 
decrease  the  average  log-likelihood  of  instances  in  a  final 
clustering.  We  selected  EM  after  reviewing  several  algo¬ 
rithms;  the  clusters  it  found  matched  the  defensive  plays 
over  99%  of  the  time  in  less  than  1000  examples. 


1  http://www.cs.waikato.ac.nz/ml/weka/ 


4.2  Action  Selection  Task 

CBQL-PR  periodically  selects  an  action  that  either  maxi¬ 
mizes  the  expected  return  ( exploiting  learned  knowledge), 
or  improves  its  knowledge  of  the  value  space  ( exploring 
the  environment)  so  as  to  maximize  the  long-term  reward. 

CBQL-PR  uses  a  set  of  case  bases  to  approximate  the 
standard  RL  Q  function,  which  maps  state-action  pairs  to 
an  estimate  of  the  long-term  reward  for  taking  an  action  a 
in  a  state  s.  There  is  one  Qa  case  base  in  this  set  for  each 
action  aEA,  where  A  is  the  set  of  actions  defined  by  the 
environment.  These  case  bases  support  a  case-based  prob¬ 
lem  solving  process  consisting  of  a  cycle  of  case  retrieval, 
reuse,  revision,  and  retention  (Aamodt  &  Plaza,  1994).  For 
faster  retrieval,  we  use  kd-trees  to  index  cases.  At  the  start 
of  each  experiment,  each  Qa  case  base  is  initialized  to  the 
empty  set;  cases  are  added  and  modified  as  new  experi¬ 
ences  are  gathered,  which  provide  new  local  estimates  of 
the  Q  function.  Cases  in  Qa  are  of  the  form  <s,  v>,  where  s 
is  a  feature  vector  describing  the  state  (it  contains  a  combi¬ 
nation  of  integer,  real,  and  symbolic  values)  and  v  is  a  real¬ 
valued  estimate  of  the  reward  obtained  by  taking  action  a 
in  state  s  and  then  pursuing  the  current  approximation  of 
the  optimal  policy  until  the  task  terminates. 

At  each  time  step,  a  state  is  observed  by  the  agent,  and 
an  action  is  selected.  With  probability  s,  a  random  action 
will  be  chosen  ( exploration ).  With  probability  1-s,  the  al¬ 
gorithm  will  predict  the  best  action  to  take  ( exploitation ). 
To  do  this,  it  reuses  each  Qa  case  base  by  performing  a 
locally-weighted  regression  using  a  Gaussian  kernel  on  the 
retrieved  k  nearest  neighbors  of  the  current  observed  state 
s.  Similarity  is  computed  using  a  normalized  Euclidean 
distance  function.  This  produces  an  estimate  of  the  value  of 
taking  action  a  in  the  current  observed  state  s.  CBQL-PR 
selects  the  action  with  the  highest  estimate,  or  a  random 
action  if  any  case  base  has  fewer  than  7  nearest  neighbors. 

Once  that  action  is  executed,  a  reward  r  and  a  successor 
state  s ’  are  obtained  from  the  RUSH  2008  environment. 
This  reward  is  used  to  improve  the  estimate  of  the  Q  func¬ 
tion.  If  the  case  is  sufficiently  novel  (more  than  a  distance 
T  from  its  nearest  neighbor)  a  new  case  is  retained  in  Qa 
with  state  .v  and  v  =  r  +  y  maxa£A  Qa  (s ')  ,  where  Qa  {)  de¬ 
notes  the  current  estimate  for  a  state  in  Qa  and  0<y<l  is  the 
discount  factor.  This  update  stores  an  estimate  of  the  value 
of  taking  action  a  in  state  .v  based  on  the  known  value  of 
the  best  successor  state  and  action.  If  the  case  is  not  suffi¬ 
ciently  novel,  the  7  nearest  neighbors  are  revised  according 
to  the  current  learning  rate  a  and  their  contribution  (3  to  the 
estimate  of  the  state’s  value  (determined  by  a  normaliza¬ 
tion  over  the  Gaussian  kernel  function,  summing  to  1).  The 
solution  of  each  case  is  updated  using: 

v  =  v  +  afl  [r  +  y  maxaeA  Qa  (s  )  -  Qa(s)] . 

Finally,  the  solutions  (values)  of  all  cases  updated  earlier  in 
the  current  trial  are  updated  according  to  their  ^-eligibility: 
v  =  v  +  (yX)'ap  [r  +  y  nmxaeA  Qa  (s  ’)  -  Qa(s)], 

where  t  is  the  number  of  steps  between  the  earlier  use  and 
the  current  update,  and  0<X<1  is  the  trace  decay  parameter. 


4.3  State  Definitions 

In  addition  to  CBQL-PR,  we  investigate  two  non¬ 
clustering  variants  of  the  algorithm  which  do  not  perform 
plan  recognition,  CBQLbase  and  CBQLopt;  they  differ  only 
in  their  representation  of  the  state.  CBQLt,ase  uses  the  time 
step  and  eight  features  from  the  set  M  described  in  Section 
4.1  (i.e.,  the  directional  movements  of  the  eight  defensive 
players  for  the  most  recent  time  step),  In  contrast,  CBQL- 
PR  instead  uses  only  the  predicted  cluster  and  the  time 
step.  Before  its  third  turn,  CBQL-PR’s  second  feature 
takes  on  a  distinct  value  indicating  no  prediction.  We  com¬ 
pared  CBQL-PR  with  CBQLbase  to  examine  whether  clus¬ 
tering  improves  CBQL-PR’s  performance  over  RL  alone. 

CBQL0pt  uses  an  optimized  5-dimensional  state  descrip¬ 
tion  which  includes  four  real-valued  features  that  are  intui¬ 
tively  helpful  in  the  QB’s  decisions  of  when  and  where  to 
throw.  It  uses  three  features  to  indicate  each  eligible  re¬ 
ceiver’s  distance  from  the  nearest  defensive  player  (indi¬ 
cating  how  well  each  is  covered).  A  fourth  feature  denotes 
the  QB’s  distance  from  the  closest  defensive  player  (indi¬ 
cating  the  likelihood  that  he  will  be  tackled  imminently). 
The  final  feature  is  the  current  time  step.  This  state  repre¬ 
sentation  more  closely  resembles  a  conventional  RL  state, 
containing  features  selected  for  easy  disambiguation  of  the 
right  action  to  use,  rather  than  capturing  opponent  plans. 
See  Table  1  for  more  details. 

5.  Evaluation 

Our  empirical  study  focuses  on  analyzing  how  the  state 
representation  affects  the  performance  of  a  case-based  Q(X) 
algorithm  on  a  task  in  RUSH  2008.  We  hypothesized  that 
clustering  via  plan  recognition  would  yield  a  state  repre¬ 
sentation  that  significantly  improves  performance.  We 
used  the  experimentation  platform  LIET,  the  Lightweight 
Integration  and  Evaluation  Testbed.  LIET  is  a  free  tool  we 
developed  that  can  be  used  to  evaluate  the  performance  of 
agents  on  tasks  in  integrated  simulation  environments. 
LIET  managed  communication  between  RUSH  2008  and 
CBQL,  ran  the  experiment  protocol,  and  collected  results. 

We  assessed  performance  in  terms  of  two  metrics:  as¬ 
ymptotic  advantage  and  regret.  Aysmptotic  advantage  is 
defined  as  the  difference  between  the  asymptotic  perform¬ 
ances  of  two  algorithms,  which  we  compute  by  averaging 
the  performance  achieved  during  the  final  10  testing  peri¬ 
ods.  The  second  metric,  regret  (Kaelbling  et  al.,  1996),  is 
the  integral  difference  between  the  performance  curves  of 
two  algorithms.  To  normalize,  the  regret  is  divided  by  a 
bounding  box  defined  by  the  most  extreme  values  in  each 


Table  1:  Summary  of  the  algorithms  used  in  the  first  experiment. 


Algorithm 

Plan 

Recognition? 

State 

State  Example 

CBQLbase 

x 

Instantaneous  direction  of  each 
defender +  time  step 

<Back.  None.  Back- 
Right.  Back.  Left.  Left. 
Back-Left.  Forward.  3  >  ; 

CBQLopl 

X 

Distance  from  receiver  to  nearest 
defender  (3)  +  distance  from  QB 
to  nearest  defender  +  time  step 

<1.5, 6.0, 3.0,  0.9, 3> 

CBQL-PR 

✓ 

Predicted  cluster +  time  step 

<cluster_5, 1> 
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Figure  3:  Comparison  highlighting  the  utility  of  our  plan 
recognition  approach  for  a  case-based  reinforcement  learner  on 
the  task  of  controlling  the  Rush  2008  QB’s  actions, 
dimension  from  both  curves.  The  domain  metric  measured 
is  the  total  reward,  as  defined  in  Section  2. 

We  compared  three  variants  of  the  learning  algorithm, 
each  using  one  of  the  different  state  representations  defined 
in  Section  4.3.  In  particular,  only  CBQL-PR  employs  a 
plan  recognition  method.  For  each  algorithm,  we  used  the 
following  values  for  the  constants  to  update  the  case  base: 
learning  rate  a=0.2,  discount  factor  7=0.999,  exploration 
parameter  s=0.2,  trace  decay  /.=(). 9,  neighbor  count  k=l, 
and  distance  threshold  x=0.001.  Both  a  and  e  were  de¬ 
creased  asymptotically  to  0  over  time. 

Each  algorithm  variant  was  tested  against  the  same  set 
of  five  defensive  plays.  Each  of  these  plays  denotes  a  dif¬ 
ferent  set  of  behaviors  for  the  defensive  team,  and  for  each 
play,  there  is  a  distinct  optimal  offensive  strategy.  As  the 
environment  is  stochastic,  a  series  of  actions  may  produce 
different  rewards  if  attempted  on  successive  trials. 

We  ran  ten  replications  of  our  experiment  for  each 
agent.  Experiments  lasted  for  100,000  training  trials,  with  a 
random  defensive  strategy  selected  at  the  beginning  of 
each  training  trial.  After  every  250  training  trials,  we 
tested  the  algorithm  ten  times  against  each  defensive  strat¬ 
egy.  Each  point  in  Figure  3  is  the  average  performance 
over  one  testing  period.  On  average,  CBQL-PR  found 
more  clusters  than  actual  plays;  the  mean  was  6.6.  How¬ 
ever,  no  cluster  corresponded  to  more  than  one  play; 
rather,  multiple  clusters  sometimes  were  found  correspond¬ 
ing  to  the  same  play.  Also,  predictions  were  found  to  be 
accurate  100%  of  the  time  in  the  limit;  each  time  a  particu¬ 
lar  cluster  was  predicted,  the  defense  was  using  the  same 
play. 

The  results  confirm  our  hypothesis  that  pattern  recogni¬ 
tion  can  significantly  improve  the  performance  of  case- 
based  RL  on  this  task.  That  is,  CBQL-PR  significantly 
outperformed  CBQLbase  and  CBQLopt  in  k-step  regret  (vs. 
CBQLbase=.Sl%,  p  <.0001  and  vs.  CBQLopt=  .271,  p<.0001) 
and  asymptotic  advantage  (vs.  CBOLbase=  82.9,  p<.0001 
and  vs.  C5gLop,=39.0,  p<.0001). 

A  key  distinguishing  characteristic  between  CBQL-PR 
and  its  variants  is  its  smaller  state  dimensionality  (i.e.,  two 
rather  than  five  or  nine).  To  test  the  hypothesis  that  this  is 
not  the  sole  reason  for  its  improved  performance,  we  also 
evaluated  CBQLran(|OIn,  a  variant  whose  first  feature  is  ran- 


Figure  4:  When  using  the  same  arity  for  the  state 
representation,  randomly  predicting  the  defensive  play 
performs  poorly  vs.  using  plan  recognition’s  predictions. 

domly  chosen  from  the  interval  [1,  #  defensive  plays],  and 
whose  second  feature  is  the  time  step  (results  in  Figure  4). 
CBQL-PR  statistically  outperforms  this  version  also  ( re¬ 
gret  vs.  CBQLrandom=.272,  p<.0001;  asymptotic  advantage 
vs.  CBQLmndom= 46.5,  p<.0001),  confirming  our  hypothesis. 

6.  Discussion 

We  showed  that  using  recognized  plans  in  the  state  repre¬ 
sentation  improves  the  performance  of  our  case-based  rein¬ 
forcement  learning  algorithm  on  a  simulated  American 
football  task.  We  compared  the  performance  of  our  algo¬ 
rithm  using  multiple  representations,  and  the  version  using 
plan  recognition  achieves  the  highest  asymptotic  perform¬ 
ance.  It  also  learns  more  quickly,  achieving  the  highest 
performance  found  by  the  runner-up  in  10%  of  the  time. 

This  performance  improvement  is  primarily  due  to  two 
advantages  of  CBQL-PR’s  state  space  formulation.  The 
first  is  its  lower  dimensionality,  while  the  second  is  that  the 
opponent’s  plans,  which  are  important  in  explaining  varia¬ 
tions  in  performance,  are  identified. 

While  useful,  this  algorithm  does  not  dominate  in  all  sit¬ 
uations.  Other  experiments  (not  discussed  in  this  paper), 
showed  that  CBQL-PR  does  not  outperform  CBQLopt 
against  all  possible  defenses.  In  particular,  when  a  single 
series  of  actions  performs  well  against  all  defenses, 
CBQLopt  performs  as  well  as  or  better  than  CBQL-PR. 
However,  CBQL-PR  may  perform  well  in  other  domains 
where  broader  opponent  strategies  can  be  grouped  into  sets 
to  be  understood  better.  In  future  work,  we  will  test  CBQL- 
PR  against  a  larger  range  of  opponent  strategies.  We  will 
also  extend  our  work  to  cover  the  full  game  of  American 
football,  including  choice  of  offensive  play  with  different 
starting  conditions  (e.g.,  distance  from  goal).  Also,  we  will 
investigate  learning  a  more  detailed  representation  of 
plays,  which  will  allow  us  to  generalize  over  similar  plays. 

7.  Conclusions 

Plan  recognition  methods  can  be  a  powerful  ally  for  ma¬ 
chine  learning  techniques.  We  investigated  the  utility  of  a 
clustering  algorithm  for  distinguishing  opponent  plans  in  a 
multi-agent  simulation  of  plays  from  an  American  football 


game.  By  replacing  a  low-level  feature  representation  with 
a  learned,  accurate  prediction  of  the  opponent’s  plan,  this 
type  of  plan  recognition  can  significantly  increase  the  per¬ 
formance  of  a  case-based  reinforcement  learner  on  an 
agent  control  task.  We  conjecture  that  similar  approaches 
can  improve  the  performance  of  learning  algorithms  on  a 
large  variety  of  tasks,  and  in  particular  for  tasks  that  can 
benefit  from  the  predictions  of  other  agents’  actions. 
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